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ABSTRACT 

Graphs have been commonly used to model many applica- 
tions. A natural problem which abstracts applications such 
as itinerary planning, playlist recommendation, and flow 
analysis in information networks is that of flnding the heav- 
iest path(s) in a graph. More precisely, we can model these 
applications as a graph with non-negative edge weights, along 
with a monotone function such as sum, which aggregates 
edge weights into a path weight, capturing some notion of 
quality. We are then interested in finding the top-fc heaviest 
simple paths, i.e., the k simple (cycle- free) paths with the 
greatest weight, whose length equals a given parameter £. 
We call this the Heavy Path Problem (HPP). It is easy to 
show that the problem is NP-Hard. 

In this work, we develop a practical approach to solve the 
Heavy Path problem by leveraging a strong connection with 
the well-known Rank Join paradigm. We first present an 
algorithm by adapting the Rank Join algorithm. We iden- 
tify its limitations and develop a new exact algorithm called 
HeavyPath and a scalable heuristic algorithm. We con- 
duct a comprehensive set of experiments on three real data 
sets and show that HeavyPath outperforms the baseline 
algorithms significantly, with respect to both £ and k. Fur- 
ther, our heuristic algorithm scales to longer lengths, flnding 
paths that are empirically within 50% of the optimum solu- 
tion or better under various settings, and takes only a frac- 
tion of the running time compared to the exact algorithm. 

1. INTRODUCTION 

With increasing availability of data on real social and infor- 
mation networks, the past decade has seen a surge in the 
interest in mining and analyzing graphs to enable a variety 
of applications. For instance, the problem of finding dense 
subgraphs is studied in the context of finding communities in 
a social network [7], and the Minimum Spanning Tree (MST) 
problem is used to find teams of experts in a network [12j . 
Several real applications can be naturally modeled using a 
problem that we call the Heavy Path Problem (HPP): given 



a weighted graph as input, find top-fc heaviest simple (i.e., 
cycle-free) paths of length £, where the weight of a path is 
deflned as a monotone aggregation (e.g., sum) of weights of 
the edges that compose the path. Surprisingly, this problem 
has received relatively less attention in the database com- 
munity. 

We present a few concrete applications of HPP below. 

(1) In [51, Hansen and Golbeck motivate a class of novel 
applications that recommend collections of objects, for in- 
stance, an application that recommends music (or video) 
playlists of a given length. They emphasize three key proper- 
ties for desirable lists: value of individual items (say songs), 
co-occurrence interaction effects, and order effects including 
placement and arrangement of items. We can abstract a 
graph from user listening history, where a node represents 
a song of high quality as determined by user ratings and an 
edge between a pair of songs exists if they were listened to 
together in one session. The weight on an edge represents 
how frequently the songs were listened together by users. 
Heavy paths in such a graph correspond to playlists of high 
quality songs that are frequently enjoyed together. 

(2) Consider an application that recommends an itinerary 
for visiting a given number of popular tourist spots or points 
of interest (POIs) in a city The POIs can be modeled as 
nodes in a graph, and an edge represents that a pair of POIs 
is directly connected by road/subway link. The weight on an 
edge is a decreasing function of the travel time (or money it 
costs) to go from one POI to another. Heavy paths in such 
a graph correspond to itineraries with small overall travel 
time (or cost) for visiting a given number of popular POIs. 

(3) Say we want to analyze how a research fleld has evolved. 
Given a citation network of research papers and additional 
information on the topics associated with papers (possibly 
extracted from the session title of the conference, or from 
keywords etc.), we can abstract a "topic graph". Nodes in 
a topic graph represent topics, and an edge between two 
topics exists if a paper belonging to one topic cites a paper 
belonging to another topic. The weight on an edge may be 
the normalized frequency of such citations. Heavy paths in 
such a graph capture strong flows of ideas across topics. 

In addition, Bansal et al. [2] modeled the problem of identi- 
fying temporal keyword clusters in blog posts (called Stable 
Clusters Problem in "2") as that of flnding heavy paths in 



a graph. In all these applications, we may be interested in 
not just the top-1 path, but top-fc heavy paths, for instance, 
k itineraries or playlists to choose from. Finding the top-fc 
heavy paths of a given length I is not trivial. The setting 
studied in [2] is a special case of the HPP where the input 
graph was /-partite (each partition corresponds to a times- 
tamp) and acyclic. They proposed three algorithms based 
on BFS (breadth first search), DPS (depth first search) and 
the TA (threshold algorithm). Due to the special structure 
of their graph, i.e., the absence of cycles and only a subset 
of nodes acting as starting points for graph traversal, they 
were able to efficiently adapt the BFS and DPS algorithms 
and show that these adaptations outperform a TA-based al- 
gorithm. Our problem is more general in that the graphs 
are not .^-partite, and typically contain many cycles. Adap- 
tations of the algorithms proposed in [2] lead to expensive 
solutions. Por example, an adaptation of DPS would require 
performing /-deep recursion starting at each node, which is 
prohibitively expensive for large general graphs. 

We present an efficient algorithm for HPP called Heavy- 
Path, based on the Rank Join paradigm, as detailed in Sec- 
tion 13.21 In the last decade, there has been substantial 
amount of work on Rank Join |10l 1111 1131 1161 117j . However, 
the experimental results reported have confined themselves 
to Rank Join over a small number of relations. In the driving 
applications mentioned above, there is often the need to dis- 
cover relatively long heavy paths. Por example, in playlist 
recommendation, a user may be interested in getting a list 
containing several tens of songs, and in itinerary planning, 
recommendations consisting of 5-15 POIs for the tourist to 
visit within a given time interval (e.g., a day or a week) are 
useful. In general, it is computationally more demanding to 
Snd top-k heaviest paths as the path length increases. There 
is a need for efEcient algorithms for meeting this challenge. 

The approach we develop in this paper is able to scale to 
longer lengths compared with classical Rank Join. By ex- 
ploiting the fact that the relations being joined are identi- 
cal, we are able to provide smarter strategies for estimating 
the requred thresholds, and hence terminate the algorithm 
sooner than classical Rank Join. In addition, we carefully 
make use of random accesses in the context of Rank Join as 
an additional means of ensuring that heavy paths are dis- 
covered sooner, and the thresholds are aggressively lowered. 
Finally, it turns out that all exact algorithms considered (in- 
cluding Rank Join and our proposed HeavyPath algorithm) 
run out of memory when the length of the desred path ex- 
ceeds 10 on some data sets. As we will see in Section [2l 
HPP is NP-hard. In order to deal with this, we develop a 
heuristic approach that works with the allocated memory 
and allows us to estimate the distance to the optimum solu- 
tion for a given problem instance. We empirically show that 
this heuristic extension scales well even for paths of length 
100 on real data sets. 

We make the following contributions in the paper: 



• We formalize the problem of finding top-A: heavy paths 
of a given length for general graphs, and establish the 
connection between HPP and the Rank Join frame- 
work for constructing heavy paths (Section [2|. 



• We present a variety of exact algorithms, including 
two baselines obtained by adapting known algorithms, 
a simple adaptation of Rank Join for computing heavy 
paths, and an efficient adaptation of Rank Join called 
HeavyPath. With simple modifications, we can turn 
HeavyPath into a heuristic algorithm with the nice 
property that we can derive an empirical approxima- 
tion ratio in a principled manner (Sections [3] ID and 
EJ. 

• We present a comprehensive set of experiments to eval- 
uate and compare the efficiency and effectiveness of 
the different algorithms on three real datasets: last.fm, 
Cora, and Bay that respectively model the three moti- 
vating applications described earlier. Our results show 
that HeavyPath is orders of magnitude faster than 
the baselines and Rank Join, and can find exact so- 
lutions for paths that are several hops longer as com- 
pared with all other algorithms. In addition, our heuris- 
tic algorithm finds paths that are empirically within 
50% of the optimum solution or better under various 
settings, while taking a fraction of the running time 
compared to the exact algorithm (Section 

We review the related work in Section [7] and conclude in 
Section m 

2. PROBLEM STUDIED 

Given a weighted graph G{V, E, W), where weight it)(u_i,) : = 
W{u, v) represents the non-negative weighlQ on edge (m, v) £ 
E, and parameters k and £, the Heavy Path Problem (HPP) 
is to find the top-fc heaviest simple paths of length £, i.e., fc 
simple paths of length £ with the highest weight. A simple 
path of length £ is a sequence of nodes P — {vo, . . . ,Vi) such 
that {vi,Vi+i) £ E, < i < £, and there are no cycles, i.e., 
the nodes Vi are distinct. Unless otherwise specified, in the 
rest of the paper, we use the term path to mean simple path. 
We note that our framework allows path weights defined 
using any monotone aggregate function of edge weights. For 
simplicity, we define the weight of a path P — {vo, ...,ve), as 
P.weight = Ej=o ''"(j.j+i) ■ 

We note that the heavy path problem for a given parameter 
£ is equivalent to the well-known /-TSF0 (/-Traveling Sales- 
person) problem [1], defined as follows: Given a graph with 
non-negative edge weights, find a path of minimum weight 
that passes through any £ + 1 nodes. It is easy to see that 
for a given length /, a path P is a solution to /-TSP iff it is 
a solution to HPP (with fc = 1) on the same graph but with 
edge weights modified as follows: let wj^ ,,) be the weight of 
an edge {u, v) in the /-TSP instance, and Wmax be the max- 
imum weight of any edge in that instance; then the edge 
weight in the HPP instance is toJ^ „) = 1 — Wf^^ y-)/wmax. 
It is well-known that TSP and /-TSP are NP-hard and the 
reduction above shows HPP is NP-hard as well. In gen- 
eral, HPP can be defined for both directed and undirected 

^Non-negativity is not a requirement, but is intuitive in most 
applications. 

^In the literature it is called fc-TSP, where fc is the given 
length of the path. We refer to it as /-TSP to avoid confusion 
with the parameter fc used for number of paths in our top-fc 
setting. 
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Figure 1: Example graph for playlist recommenda- 
tion, and corresponding edge weights table 

graphs. Our algorithms and results focus on the undirected 
case, but can be easily extended to directed graphs. 

We next briefy discuss the utility of heavy paths for applica- 
tions, compared with cliques. Some data sets like one that 
represents the road network (see Bay dataset in Section [6| 
naturally do not exhibit large cliques. On such data sets, 
heavy paths order the items (i.e., nodes) in a sequence de- 
termined by the graph structure. On the other hand, some 
data sets may contain large cliques. Figure[T]shows an exam- 
ple of the co-listening graph for the playlist recommendation 
application (this is a subgraph induced over 6 songs from the 
last.fm dataset used in our experiments - see SectionlS)). Say 
the task is to find heavy paths of length £ — 4. Even when 
the graph is a clique and any permutation of 5 nodes will 
result in a path of length 4, the order in which the nodes 
are visited can make a significant difference to the overall 
weight of the path. Here, the heaviest simple path of £ = 4 
is obtained by visiting nodes in the order 6-1-2-3-4 has a 
weight of 3.35. In contrast, a different permutation of the 
nodes 4-3-6-1-2 has a weight of 3.08. 

3. FINDING HEAVY PATHS 
3.1 Baselines 

An obvious algorithm for finding the heaviest paths of length 
£ is performing a depth-first search (DPS) from each node, 
with the search limited to a depth of £, while maintaining 
the top-fc heaviest paths. This is an exhaustive algorithm 
and is not expected to scale. A somewhat better approach 
is dynamic programming. Held and Karp [9] proposed a 
dynamic programming algorithm for TSP, that works with 
a notion of "allowed nodes" on paths between pairs of nodes, 
which we adapt to HPP as follows. For a set of nodes 
S, we say there is an S-avoiding path from node x to y 
provided none of the nodes on this path other than x, y are 
in the set S. E.g., the path (1,2,3) is {1, 4, 5}-avoiding but 
not {2, 4}-avoiding. The idea is to find the heaviest simple 
path of length £ ending at j for every node j, and then find 
the heaviest among them. To find the heaviest simple path 
ending at j, we find the heaviest simple {j}-avoiding path 
of length £ that ends at j. 

The heaviest path of length 1 (starting anywhere) and end- 
ing at a given node j is simply the heaviest edge ending at j. 
In general, the heaviest (simple) S-avoiding path of length 
2 < I < £ ending at j is found by picking the heaviest among 
the following set of combinations: concatenate the heaviest 
S U {2/}-avoiding path (from anywhere) of length Z — 1 to any 
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Figure 2: Adapting Rank Join for HPP 

neighbor y of j where j £ S and y ^ S, with the edge {y,j)- 

We can apply this idea recursively and easily extend the 
dynamic program for finding top-fc heaviest paths of a given 
length. We skip the details for brevity. The equations and 
details of the dynamic programming algorithm can be found 
in Appendix [X] Clearly, both dynamic programming and 
DPS have an exponential time complexity, although unlike 
DPS, dynamic programming aggregates path segments early, 
thus achieving some pruning. Both algorithms are used as 
baselines in this paper. 

3.2 Rank Join Algorithm for HPP 

The methods discussed above are unable to prune paths that 
have no hope of making it to the top-fc. Rank Join [10U17) is 
an efficient algorithmic framework specifically designed for 
finding top-A: results of a join query. We briefly review it 
and discuss how it may be adapted to the HPP problem. 

Background. We are given m relations Ri, Rm- Each 
relation contains a score attribute and is sorted on non- 
ascending score order. The score of each tuple in the join 
Ri M ••• N Rm is defined as f{Ri. score, Rm score) 
where Ri. score denotes the score of the current tuple in 
Ri and / is a monotone aggregation function. The prob- 
lem is to find the top-fc join results for a given parameter 
k. The Rank Join algorithm proposed by Ilyas et al. [ID] 
works as follows. (1) Read tuples in sorted order from each 
relation in turn. This is called sorted access. (2) Por each 
tuple read from Ri, join it with all tuples from other re- 
lations read so far and compute the score of each result. 
Retain the k result tuples with the highest score. (3) Let 
di be the number of tuples read from Ri so far and let 
be the tuple at position j in Ri. Define a threshold 9 : = 
max.{f{tf^ .score, t^. score, . . . , tm. score), . . . , f{ti. score, . . . , 

\score, . . . , tm. score), . . . , f(t\. score, . . . , t}. score, . . . , 
tm\scor-e)}. This threshold is the maximum possible score 
of any future join result. (4) Stop when there are k join 
results with score at least 9. It is clear no future results can 
have a score higher than that of these k results. 

In the rest of the paper, we will assume / is the sum func- 
tion. Our results and algorithms carry over to any monotone 
aggregation function. 

Adapting Rank Join for HPP. Given a weight sorted 
tabic E of edges, HPP can be solved using the Rank Join 
framework. Indeed, paths of length £ can be found via an 
£-way self-join of the table E, where the join condition can 
ensure the end node of an edge matches the start node of the 
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Figure 3: Example instance of HPP. Graph has one 
heavy path and n lighter paths. 

next edge and cycles are disallowed. In particular, whenever 
a new edge is seen from E (under sorted access), it is joined 
with every possible combination of £ — 1 edges that have 
already been seen to produce paths of length £. Only the 
k paths with the highest weight are retained in the buffer. 
Let d be the depth of the edge (tuple) last seen under sorted 
access and let Wd denote the weight of the edge seen at 
depth d and w,nax be the maximum weight of any edge. 
The threshold is updated as 9 := Wd + {£— l)wmax. We stop 
when k paths of length £ are found with weight at least 9. 
For simplicity, we refer to this adapted Rank Join algorithm 
as just Rank Join. 

Consider the example graph in Figure [T] A Rank Join 
performed on 4 copies of the edge weight table, with an 
appropriately defined join condition can be used to com- 
pute top-fc heavy paths of length £ — 4. The Rank Join 
algorithm proceeds by scanning the edge table in sorted 
order of the edge weights. Figure 2(a)| illustrates a snap- 
shot during the execution, where the algorithm has "seen" 
or scanned 4 tuples from the edge table, and therefore the 
depth d = 4. Also, the weight of edge seen at depth 4 
Wd = 0.77 and Wmax = 0.93. At this depth, the threshold 
is updated as 6* = 0.77 -I- (4 - 1) * 0.93 = 3.56. For this 
particular example, Rank Join scans all edges in the table 
before being able to output the top-1 path of ^ = 4 as the 
weight of the top path (i.e., 3.35) is not over the threshold 
9 = 0.58 + (4 - 1) * 0.93 = 3.37 even at depth d ^ \E\. 

4. LIMITATIONS AND OPTIMIZATIONS 

In this section, we make some observations about the lim- 
itations of using Rank Join for finding top-fc heavy paths 
in general graphs and discuss possible optimizations. We 
establish some key properties of (the above adaptation of) 
Rank Join, which will pave the way for a more efficient al- 
gorithm in Section [5] 



case, we can show the following result on Rank Join, for the 
special case k = 1. 

LEMMA 1. Consider an instance of HPP where the weight 
of the heaviest path of length £ is smaller than Wmin -\- {£ — 
1) X Wmax, where WmaxCind Wminare the weights of the heav- 
iest and lightest edge in the graph, respectively. For this 
instance, Rank Join produces every path of length £. 

Proof. The Rank Join algorithm stops when there is a 
path of length £ in the buffer whose weight is not smaller 
than the threshold. Assume the heaviest path P is lighter 
than Wmin -|- (^ — 1) X Wmax , and Rank Join stops after seeing 
an edge e with weight w > Wmin- In this case, the threshold 

is 9 = W+{£—l)Wmax > Wmin + {£- l)Wmax > P.WCight, SO 

by definition. Rank Join cannot terminate, a contradiction! 
This shows, it must see all edges with weight Wminbefore 
halting. By this time, by definition, Rank Join will have 
produced all paths of length £. □ 

The observation and lemma motivate the following optimiza- 
tion for finding heaviest paths. 

OPTIMIZATION 1. We should try to avoid delaying 
the production of a path of a certain length until the lightest 
edge on the path is seen under sorted access. One possible 
way of making this happen is via random access. However, 
random accesses have to be done with care in order to keep 
the overhead low. 



OBSERVATION 1. Let P be a path of length £ and 
suppose e is the lightest edge on P, i.e., its weight is the 
least among all edges in P. Then until e is seen under sorted 
access, the path P will not be constructed by Rank Join. 



Figure 2(b) shows that an edge (3,4) that joins with the 
edge (2,3) can be accessed sooner by using random access. 
It is worth noting that Ilyas et al. [10] mention the value of 
random access for achieving potentially tighter thresholds. 
Of course, the cost of random access is traded for the pruning 
power of the improved threshold. Indeed, in recent work 
Martinenghi and Tagliasacchi [14] study the role of random 
access in Rank Join from a cost-based perspective. 

Following this optimization, suppose we use random accesses 
to find "matching" edges with which to extend heaviest paths 
of length ^ — 1 to length £. This is a good heuristic, but 
heaviest paths of length £—1 may not always lead to heaviest 
paths of length £. A natural question is whether the use of 
random accesses in this manner will lead to a performance 
penalty w.r.t. Rank Join. Our next result shows that Rank 
Join will produce heaviest paths of length £ — 1 before it 
produces and reports the heaviest path of length £. 



We illustrate this observation with an example graph in Fig- 
ure [3] for the task of finding the heaviest path of length 3. 
The graph consists of one path of weight 2.001 (the top path) 
and n other paths of weight 0.03 + 0.02 -I- 0.01 = 0.06, all 
of length 3. Clearly, the top path (call it P) is the unique 
heaviest path of length 3. However, Rank Join must wait 
for the edge with weight 0.001 (the lightest in the graph) to 
be seen before it can find and report P. Until then, it would 
be forced to produce (and discard!) the n paths of length 
3, each with weight 0.06. Since n can be arbitrarily large. 
Rank Join is forced to construct an arbitrarily large number 
of paths most of which are useless. Indeed, as an extreme 



LEMMA 2. Run Rank Join on an instance of HPP first 
with input length £ — 1, and then with input length £. Sup- 
pose d' (resp., d) is the depth at which Rank Join finds the 
heaviest path of length £ — 1 (resp., £) for the first time in 
the respective runs, then, d > d' . Thus, by the time Rank 
Join produces the heaviest path of length £, it also produces 
the heaviest path of length £ — 1. 

Proof. Suppose P is the heaviest path of length £ and Q 
is the heaviest path of length ^ — Suppose Wd' and Wd are 

Ties are broken arbitrarily. 



the edge weights at depths d' and d resp. in the edge table 
E. Assume d < d' . We know that w^' + {£ — 2) x Wmax < 
Q.weight < Wji_i + {£ — 2) x Wmax and uj^ + — 1) x 
Wmax < P.weight. Therefore, P.weight — Wmax > Wd + {£ — 
2) X Wmax > Wd'-i + {£ - 2) X Wmax > Q.wcight. Now, 
p.weight — Wmax is a lower bound on the heaviest sub-path 
of P of length £ — 1. This means P has a sub-path of length 
£ — 1 that is at least as heavy as Q, and can be returned by 
Rank Join at depth d < d' . This contradicts the fact that Q 
is the first heaviest path of length £ — 1 discovered by Rank 
Join. Thus, the heaviest path of length ^ — 1 is found no 
later than the heaviest path of length £. □ 

Figures |3] and |4] illustrate the notions of depth and edge 
weight at a given depth. The leftmost table in Figure [4] 
represents the sorted edge list for the graph in Figure [31 
While computing the heaviest path of length 1 = 3, Rank 
Join obtains the heaviest path of £ — 1 (i.e., 2) at depth d' . 
Without random accesses, the heaviest path of length £ — 3 
is obtained after scanning edge (c,d) at depth d. 

The real purport of the above lemma is that by the time 
Rank Join reports a heaviest path of length £, it has already 
seen heaviest paths of length £—1. Thus, any heaviest paths 
of length ^ — 1 we try to extend to length £ (using random 
access) are necessarily produced by Rank Join as well, albeit 
later than if we were to use no random access. 



OBSERVATION 2. Rank Join uses a very conserva- 
tive threshold Wd + {£ — l)wmax, by supposing the new edge 
seen may join with £ — 1 edges with maximum weight. In 
many cases, the new edge just read may not join with such 
heavy edges at all. 
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Figure 4: HeavyPath using buffers to extend paths 
for Figure \3\ 

Figure [4] schematically describes the idea of using buffers 
for different path lengths. Buffer Bi is the sorted edge list 
for the graph in Figure [S] Buffers B2 and store paths 
of length 2 and 3 respectively. When random accesses are 
performed to extend paths of length I — 1 to those of length 
/, the buffers can be used to store these intermediate paths. 
For instance, when the heaviest path of length 1 is seen, say 
edge (a, b) is seen, it can be extended to paths of length 2 by 
accessing edges connected to its end points, as represented 
in B2 in Figure ID Similarly, the heaviest path of length 2 
can be extended by an edge using random access to obtain 
path(s) in buffer B3. 

OBSERVATION 3. Rank Jom, especially when performed 
as a single £-way operation, tends to produce the same sub- 
path multiple times. For example, when Rank Join is re- 
quired to produce the heaviest path of length 3 for the graph 
in Figure\^ for paths {a' ,b' , c' , d'^) , i £ [1, n], it produces the 
length 2 sub-path {a',b',c') n times as it does not maintain 
shorter path segments. 



One way of making the threshold tighter is by keeping track 
of shorter paths. For example, if we know P is a heaviest 
path of length £ — 1, we can infer that the heaviest path of 
length £ cannot be heavier than P.weight -\- Wmax , a bound 
often much tighter than Wd -'r {£ — l)?«maJ3. For this, the 
(heaviest) paths of length £ — 1 have to be maintained. Pur- 
suing this idea recursively leads to a framework where we 
maintain heaviest paths of each length i, 2 < i < £. More 
precisely, we can perform the following optimization. 

OPTIMIZATION 2. Maintain a buffer Bi for the heav- 
iest paths of length i, and a threshold 9i on the weight of all 
paths of length i that may be produced in the future, based on 
what has been seen so far. When a new heaviest path P of 
lengthi—1 is found, i.e., P.weight > 6i-i, update the thresh- 
oldOifor buffer Bi as 6i — msLx{Oi-i, Bi-i.topScore)-\-Wmax 
where Bi-i.topScore is the weight of the heaviest path in 
Bi-i after the current heaviest path P with P.weight > 6i-i 
is removed from _Bi_]0. The rationale is that, in the best 
case, a heaviest future path of length i may be obtained by 
joining the current heaviest path in Bi^i with an edge that 
has the maximum weight, or joining a future path in Bi-i 
with such an edge. 



*Our algorithm makes use of an even tighter threshold as 
described shortly. 

^ Bi-i.topScore may be greater or less than 6i-i. 



This observation motivates the following optimization. 

OPTIMIZATION 3. Shorter path segments can be main- 
tained so they are created once and used multiple times when 
extended with edges to create longer paths. For instance, in 
Figure\^ we can construct (and maintain) the length 2 path 
segment {a',b',c') once and use it n times if needed as edges 
are appended to extend it into n different length 3 paths. 

A concern surrounding a random access based approach to 
extending heaviest paths of length / — 1 into paths of length I 
is that it may result in extending too many such paths. Our 
next result relates these paths to those produced by Rank 
Join as part of its processing. 

LEMMA 3. Let P be the heaviest path of length £. When 
Rank Join terminates, every path of length £ — 1 that has 
weight no less than P.weight — Wmax, will be created. 

Proof. Suppose Rank Join finds P at depth d. This 
means P.weight > Wd -\- {£ — 1) x Wmax, and, P.weight — 

Wmax > -f (£ — 2) X Wmax- Notice that p.weight - Wmax 

is a lower bound on the weight of the heaviest path of length 
£ — 1. Therefore, if there is a path of length £ — 1 that is 
heavier than P.weight — Wmax , it will be produced by Rank 
Join by depth d. □ 



Algorithm 1 HeavyPath [E, £, k) 

Input: Sorted edge list E, path length £, number of paths 
k 

Output: top-fc heaviest paths of length E 
for « = 2 to £ do 

Bi -h- % / / empty sorted set 

Ol = W^nax X / 

topPaths <— / / empty sorted set 
while I topPaths |< k do 

topPaths 4- topPaths U NextHeavyPath {E, I) 



Algorithm 2 NextHeavyPath {E, I) 

Input: Sorted list of edges E, and path length / 

Output: Next heaviest path of length I 

1: if ; = 1 then 

2: P^ ^ ReadEdge(£') 

3: 02 = 2 X P^ .weight 

4: return P^ 

5: while Bi.topScore < 8i do 

6: P'~^ NextHeavyPath {E, I ~ 1) // recursion 

7: s,t <- EndNodes(P'~^) 

8: for all 1/ G 1/ I {y, s) € E do 

9: Bi -1^ BiU {{y, s) + P'"^) // avoiding cycles 
10: for all z eV \ {t, z) e E do 
11: Bi BiVJ (P'~^ + {t, z)) II avoiding cycles 

12: P' <- REMOVETOPPATH(Pi) 

13: if / < ^ then 

14: ^i+i = ma.x{Bi.topScore, 6i) + Wmax 
15: return P' 



The results, observations, and optimizations discussed in 
this section suggest an improved algorithm for finding heav- 
iest paths, which we present next. 

5. HEAVYPATH ALGORITHM FOR HPP 

We start this section by providing an outline of our main 
algorithm for solving HPP, and subsequently explore some 
refinements. Our algorithm maintains a buffer Bi for stor- 
ing paths of length i explored so far, where 2 < i < I. Let 
threshold 9i denote an upper bound on the weight of any 
path of length i that may be constructed in the future. Each 
buffer Bi is implemented as a sorted set of paths of length i 
sorted according to non-increasing weight and without du- 
plication (e.g., paths {a,b,c) and {c,b,a) are considered du- 
plicates). Algorithm [1] describes the overall approach. It 
takes as input a list of edges E sorted in non-increasing or- 
der of edge weights, and parameters £ and k. It calls the 
NextHeavyPath method (Algorithm 0) repeatedly until 
the top-fc heaviest paths of length £ are found. 

Algorithm [2] describes the NextHeavyPath method. It 
takes as input a list of edges E sorted in non-increasing 
order of edge weights, and the desired path length I, I > 2. 
It is a recursive algorithm that produces heaviest paths of 
shorter lengths on demand, and extends them with edges to 
produce paths of length £. The base case for this recursion 
is when 1 = 1 and the algorithm reads the next edge from 
the sorted list of edges. The ReadEdge method returns the 
heaviest unseen edge in E (sorted access, line 2). li I < £, 
the path of length I obtained as a result of the recursion 
is extended by one hop to produce paths of length I + 1. 



Specifically, a path of length I < £ is extended using edges 
(random access, lines 8 and 10) that can be appended to ei- 
ther one of its ends (returned by method EndNodes). The 
"-f " operator for appending an edge to a path is defined in a 
way that guarantees no cycles are created. The threshold 6i 
is updated appropriately and when it becomes smaller than 
the weight of the heaviest path in buffer Bi , the next heav- 
iest path of length I that is greater than the threshold is 
returned. This is done by calling the method RemoveTop- 
Path for buffer Bi and returning the resulting path. If / < £ 
and the next heaviest path of length I has been obtained, 
Oi^i is updated. 

Updating the thresholds. To start, Algorithm[5]explores 
the neighborhood of the heaviest edge for finding the heav- 
iest path of length 2. Heavy edges are explored and the 
threshold 02 is updated until the weight of the heaviest ex- 
plored path of length 2 is greater than 62- The NextHeavy- 
Path algorithm updates O2 aggressively when the next heav- 
iest edge is seen. It uses the fact that any path of length 
2 created later from currently unseen edges cannot have a 
weight greater than 2 x P^ .weight, where P^ is the lightest 
edge seen so far. For / > 1, 9i+i is updated when the next 
heaviest path of length I is obtained. Therefore, we can use 
Ol as an upper bound on the weight of any path of length I 
that can be created in the future from the buffers Bi , where 

1 < I. The maximum of 9i and the weight of heaviest path 
in Bi (i.e., Bi.topScore) provides an upper bound on the 
weight of any path of length I that can be created in the 
future (after the previous heaviest path of length I). Adding 
Wmax to the obtained upper bound provides a new (tighter) 
threshold for paths of length I + 1. 

THEOREM 1. Algorithm HeavyPath correctly finds 
top-k heaviest paths of length £. 

Proof. The proof is by induction. The base case is for 
going from edges to paths of length 2. Given that all of the 
edges above depth d are extended, the heaviest path that 
can be created from them is already in B2. The weight of 
the heaviest path that can be created from lighter edges is 
at most 2 X uid- If the heaviest path in B2 is heavier than 

2 X Wd, then it must be the heaviest path of length 2. 

Assuming the heaviest paths of length / are produced cor- 
rectly in sorted order, we show the heaviest path of length 
Z -I- 1 is found correctly. Suppose P, the heaviest path of 
length I + 1, is created for the first time from by extend- 
ing Q, which is the n"' heaviest path of length I. The next 
heaviest path of length I is either already in Bi or has not 
been created yet. Therefore, ma,x{9i, Bi.topScore) is an up- 
per bound on the next heaviest path of length I that has 
not been extended and any edge that can join this path can 
have weight at most Wmax- Suppose when the m**^ heavi- 
est path of length I is seen, max(^;, Bi.topScore) + Wmax is 
updated to a value smaller than P.weight. It is guaranteed 
that P is already in and has the highest weight in that 
buffer. In other words, when the threshold is smaller than 
P.weight, the difference between the weight of P and next 
heaviest path of length I is more than Wmax . Now, paths of 
length I + 1 that can be created from heavier paths of length 
I are already in the buffer, and no unseen path of length 



I can be extended to create a path heavier than P. There- 
fore, P is guaranteed to be the heaviest path. The preceding 
arguments hold for top-fe heaviest paths where fc > 1. □ 

Algorithm [2] extends the heaviest paths in sorted order, to 
avoid their repeated creation. However, since paths are ex- 
tended by random accesses, it is possible to create a path 
twice, which is unnecessary. For instance, a path of length 
I may be created while extending its heaviest sub-path and 
again while extending its lightest subpath of length I — 1. 

5.1 Duplicate Minimization by Controlling Ran- 
dom Accesses 

Algorithm NextHeavyPath extends a path of length / to 
one of length Z + 1 by appending all edges that are incident 
on either end node of the path. Since these edges are not 
accessed in any particular order, in the literature of top-fc 
algorithms, they are referred to as random accesses. In this 
section, we develop a strategy for controlling random ac- 
cesses performed by Algorithm NextHeavyPath for min- 
imizing duplicates. Duplicate paths of length / -I- 1 can 
be created either due to extending the same path of length 
I, or by extending two different subpaths of the same path 
of length I -\- 1. Our solution for avoiding duplicates of the 
first kind is implementing every buffer as a sorted set. This 
avoids propagation of duplicates during execution. Further, 
the threshold update logic guarantees that if there is a copy 
P' of some path P that is already in the buffer Bi , the path 
will not be returned before its copy P' makes it to Bi. The 
algorithm ensures that when P is returned, P' either does 
not exist or has been constructed and eliminated. 

In addition to eliminating duplicates that have been created, 
we take measures to reduce their very creation. Suppose P is 
a path of length I + 1 whose right sub-path of length I is the 
heaviest path of length I and its left sub-path of length I is 
the second heaviest path of length I. Since random accesses 
are performed at both ends of a path, P will be created 
twice, using each of the top-2 paths of length I. 

One possible solution is to perform random accesses at one of 
the ends of a path. Although this prevents duplicate creation 
of P, it does not allow fully exploring the neighborhoods 
of heavier edges. For example, consider a path with edges 
{{a,b), {b, c), {c, d)} that is the heaviest path of length 3, 
with (b, c) the heaviest in the graph. If the addition of edges 
is restricted to the beginning of the path, the construction 
of the heaviest path of length 3 will be delayed until (a, b) is 
observed. There can be graph instances for which this can 
happen at an arbitrary depth. Therefore, it is advantageous 
to extend paths on both ends, and we dismiss the idea of 
one sided extension. 

LEMMA 4. Suppose Q is the n**" heaviest path of length 
I with (a, b) as its heaviest edge. No new path of length I + 1 
can be created from Q by adding an edge which is heavier 
than (a, 6) using the NextHeavyPath method. 

Proof. Let P be a new path of length I + 1 created from 
Q. If P is derived for the first time, it can not have a subpath 
of length I that is heavier than Q. Otherwise, the heavier 



subpath is one of the n — 1 paths created before Q. On 
the other hand, adding any edge to the end of Q which is 
heavier than (a, b) results in a path of length / + 1 that has a 
subpath of length I heavier than Q. The lemma follows. □ 

Therefore, no new path of length I -I- 1 can be created by 
adding an edge to Q, that is heavier than the heaviest edge 
of Q. This leads to the following theorem. 

THEOREM 2. Using NextHeavyPath, every new path 
of length I + 1 is created only by extending its heavier sub- 
path of length I. No path is created more than twice. A path 
of length l + l is created twice iff both its sub-paths of length 
I have the same weight. □ 

The strategy for controlling random accesses embodied in 
Theorem [2] can be generalized to a stronger strategy as fol- 
lows. 

FACT 1. Using NextHeavyPath, given a path of length 
I, no new path of length I -\- 1 can be created by adding an 
edge to its rightmost node that is heavier that its leftmost 
edge, or adding an edge to its leftmost node that is heavier 
than its rightmost edge. 

In the rest of the paper, we refer to the strategy for control- 
ling random accesses described in Fact [T] as random access 
strategy. Notice that Algorithm HeavyPath always employs 
random access in addition to sorted access. Additionally, we 
have the option of adopting (or not) the random access strat- 
egy above for controlling when and how random accesses are 
used. We have: 

FACT 2. // all of the edge weights in the graph are dis- 
tinct, every path is created only once when the random access 
strategy mentioned above is followed. 

In the rest of this paper, we follow the random access strat- 
egy of Fact [T] for performing random accesses unless other- 
wise specified. We refer to this way of performing random 
accesses as random access strategy. In our experiments, we 
measure the performance of HeavyPath both without and 
with this strategy. 

5.2 HeavyPath Example 

In this section, wc present a detailed example that illustrates 
the creation of paths by HeavyPath and Rank Join for 
finding the heaviest path (i.e., fc = 1) of length 3 for the 
graph in Figure |3l The idea of using buffers was already 
illustrated in Figure U to which we refer below. Later in 
Section [6l we present several examples of heavy paths found 
by our algorithms for the applications of finding playlists 
and topic paths. 

HeavyPath first reads the first heaviest edge (a, b) and then 
extends it using a random access to edge (6, c) into the path 
(a, b, c) of length 2. It then reads edge (6, c) again under 
sorted access and tries to extend via a random access to 



EdgeList 




Figure 5: Paths created by Rank Join on Figure [51 

(a, 6). The duplicate derivation of tlie patli (a, b, c) is cauglit 
and discarded. Edge (&, c) is extended witii anotlier random 
access into tlie path {b, c, d). At this point, paths (a, b, c) of 
weight 2 and (6, c, d) of weight 1.001 are added to buffer B2. 
The threshold 62 at this point is 2 and is updated to 0.06 
when the next edge [a ,b') is visited under sorted access. 
At this time, the two lieaviest paths of lengtli 2 are botli 
above the tlireshold and are returned. Of these two paths, 
(a, b, c) is extended with a random access to edge (c, d) to 
form a length 3 path. If we do not adopt the random access 
strategy (see Fact[TJ Section [STTJ, then {b,c,d) will be simi- 
larly extended and again the duplicate derivation would be 
discarded. If we adopt the random access strategy, random 
access is restricted to edges whose weight is no more than 
that of the edges at either end of the path, so (&, c, d) will 
not be similarly extended. Now, 62 is updated to 2 x 0.03 
and 63 is updated to 1.06. The heaviest path of length 3 
found so far, which has weight 2.001, is reported. 

HeavyPath performs 5 joins in total before reporting the 
heaviest path of length 3. That includes the joins for paths of 
length 2 and 3 that are created, and the additional duplicate 
path that is created and removed during the execution. 

Figure [S] illustrates Rank Join for the same graph and pa- 
rameter settings. Rank Join is not able to produce the heavi- 
est length 3 path until it scans every edge under sorted access 
for this graph instance. Each edge is joined twice with the 
partial list of edges that are scanned before it, to construct 
paths of length 3. Rank Join produces n+1 paths of length 3 
and two paths of length 2 in the order shown in Figure [5] It 
performs a total of 2n -I- 4 join operations before finding the 
heaviest length 3 path. This example demonstrates that the 
performance of Rank Join can be significantly worse than 
HeavyPath in terms of the number of join operations it 
performs. 

5.3 Memory-bounded Heuristic Algorithm 

As described in the earlier sections, the problem of finding 
the heaviest path(s) is NP-hard. Even though Rank Join 
uses a fixed buffer space by storing only the top-fc heaviest 
paths at any time, it needs to construct many paths and may 
run out of allocated memory. On the other hand, Heavy- 
Path explicitly stores intermediate paths in its buffers, and 
in doing so it may run out of memory (see Section |S] for per- 
formance of various algorithms and their memory usage). 
From a practical viewpoint, having a memory bounded al- 
gorithm for HPP would be useful. Thereto, we propose a 



Algorithm 3 HeavyPathHeuristic(£, B, ) 

Input: Path length I, buffers B from a run of HeavyPath 

Output: a heavy path of length £ and ratio p 
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10: 


i = LastBufferIndex () // last non-empty buffer 


11: 


q ^ [e/u - 1)J 


12: 


r ^ £ — q X (j — 1) 


13: 


if r > then 


14: 


U£ <- Uj-1 X q+Ur 


15: 


else 


16: 


i- Uj-i X q 


17: 


p = B£.topScore/U£ 


18: 


return(REMOVETOPPATH(B£), p) 



heuristic algorithm for HPP called HeavyPathHeuristic 
that takes an input parameter C which is the total number 
of paths the buffers are allowed to hold, in all. If Heavy- 
Path fails to output the exact answer using the maximum 
allowed collective buffer size C, its normal execution stops 
and it switches to a greedy post-processing heuristic. Al- 
gorithm [3] provides the pseudocode for this heuristic. The 
post-processing requires only constant additional memory 
0{dmax X I), where d,nax is the maximum node degree in 
the graph. This is negligible in most practical scenarios, but 
should be factored in while allocating memory. 

Suppose i is the index of the last non-empty buffer (returned 
by the routine LastBufferIndex) when HeavyPath runs 
out of the total allocated buffer size C. At this time, the 
heaviest path of length i — 1 has already been created and 
extended. However, 9i is still larger than the weight of the 
heaviest path in Bi. HeavyPathHeuristic takes the cur- 
rent heaviest path of length i and performs random accesses 
to create longer simple paths. Note that these random ac- 
cesses do not follow the rule described in Section fS.ll in or- 
der to have a better chance of finding heavier paths. Hav- 
ing done this, it calls LastBufferIndex in a loop. Last- 
BufferIndex returns i + 1 if the previous path on the top 
of Bi has been successfully extended to create at least one 
path of length i -1- 1. The while loop continues until a path 
of length £ is found. In case none of the existing paths in 
Bi can be extended to create paths of length £, it backtracks 
to a buffer of shorter path length. The whole process is 
guaranteed to run in 0{C + d,nax x £) buffer size. 

Empirical Approximation Ratio. In addition to return- 
ing a path of length £, this heuristic can estimate the worst- 
case ratio between the weight of the path found and the 
maximum possible weight of the heaviest path of length £. 
We call this the empirical approximation ratio. Let Ui de- 
note the maximum possible weight of a path of length I. 
The empirical approximation ratio, denoted p, is defined as 
p = B£.topScore/U£. If HeavyPath makes it to the 



Measures 


Cora 


last.fm 


Bay 


Nodes 


70 


40K 


321K 


Edges 


1580 


183K 


400K 


Average Degree 


22.6 


4.5 


1.2 


Number of components 


1 


6534 


1 



Table 1: Summary of datasets 



buffer, the heaviest path of length I < j — 1 has ah'eady 
been found and can be used to provide an approximation ra- 
tio along with the output. If j is the last non-empty buffer 
when HeavyPath terminates, Ui is not known for I > j. 
Since the approximation ratio mentioned above is the worst 
case, we refer to Ui as a pessimistic upper bound on the 
weight of the heaviest path of length I, for I > j. The main 
idea behind this calculation is the following (see lines 11- 
17): if £ is divisible by j — 1, then no path of length £ can 
be heavier than Uj-i x {£/{j — 1)), since Uj^i is an upper 
bound on the weight of any path of length j — 1; if ^ is not 
divisible by j — 1, let r be the remainder of the division. By 
the same reasoning as above, Uj-i x L^/(i — 1)J + Ur is an 
upper bound on the weight of any path of length £. 

6. EXPERIMENTAL ANALYSIS 
6.1 Experimental Setup 

Algorithms Compared. We implemented the algorithms 
Dynamic Programing, Rank Join and HeavyPath (without 
and with the random access strategy). We also implemented 
the heuristic algorithm HeavyPathHeuristic and a simple 
Greedy algorithm to serve as a quality baseline for Heavy- 
PathHeuristic. We evaluate our algorithms over three real 
datasets: Cora, last.fm and Bay, summarized in Table [1] 
The distributions of edge weights for the three datasets can 
be found in Figure [6] 

Cora. We abstract a topicgraph from the Cora Research 
Paper Classification datasel[j. Nodes in the topic graph rep- 
resent research topics into which research papers are classi- 
fied, and an edge between two topics a, b represents that a 
paper belonging to topic a cited a paper belonging to topic b 
or vice versa or both. The weight on an edge is computed as 
the average of the fraction of citations from papers in topic 
a to papers in topic 6 and vice versa. A heavy path in the 
topic graph captures the flow of ideas across topics. 

last.fm. The last.fm data was crawled using their API ser- 
vic^U- Starting with the seed user "RJ" (Richard Jones was 
one of the co-founders of last.fm), we performed a breadth 
first traversal and crawled 400K users. Of these, 163K users 
had created at least one play list, for a total of 173K play lists 
with I.IM songs in all. We use these playlists as a proxy for 
listening sessions of users to build a co-listening graph. A 
node in the co-listening graph is a song and an edge repre- 
sents that a pair of songs were listened to together in sev- 
eral playlists. The weight of an edge is defined as the Dice 
coefficienl|3. We filtered out edges that had a dice coeffi- 
cient smaller than 0.1. The graph obtained has 6534 con- 
nected components, which implies that there are many pairs 
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of songs that are not heard together frequently. A heavy 
path in the co-listening graph captures a popular playlist of 
songs with a strong "cohesion" between successive songs. 



Bay. Our third dataset is a road network graph of the San 
Francisco Bay area. It is one of the graphs used for the 
9"' DIMACS Implementation Challenge for finding Short- 
est Pathfl The nodes in the graph represent locations and 
edges represent existing roads connecting pairs of locations. 
The weight on an edge represents the physical distance be- 
tween the pair of locations it connects. We normalize the 
weights as described in Section [2] to solve the lightest path 
problem on this graph. 



Implementation Details. All experiments were performed 
on a Linux machine with 64GB of main memory and 2.93GHz- 
8Mb Cache CPU. To be consistent, we allocated 12GB of 
memory for each run, unless otherwise specified. The algo- 
rithms were implemented in Java using built-in data struc- 
tures (that are similar to priority queues) for implementing 
buffers as sorted sets. Our implementation of Rank Join 
was efficient in that it maintained a hash table of scanned 
edges to avoid scanning the edge list every time a new edge 
was read under sorted access. Similarly, for efficient random 
access, we maintained the graph as a sparse matrix. 

6.2 Experimental Evaluation 

Running time for varying path lengths. We compare 
the running times of the four algorithms proposed in Sec- 
tion [3] for solving HPP. Figures |7(a)| |7(b)| |7(c)| show the 
running time for finding the top-1 path of various lengths 
for Cora, last.fm and Bay datasets respectively. The run- 
ning time increases with the path length for all algorithms 
as expected. For all three datasets, the Dynamic Program- 
ing algorithm is orders of magnitude slower than the other 
algorithms even for short paths of length 2. For Cora, Dy- 
namic Programing took over a day to compute the heaviest 
path of length 6, albeit using only 4GB of the allocated 
12GB of memory. In all cases, except for ^ = 3 on Cora (see 
Figure 7(a) [ |, Rank Join is slower than HeavyPath, with 
the difference in running times of the two methods increas- 
ing with path length. We investigated the single instance 
where Rank Join was faster than HeavyPath and noticed 
that for short paths that contain high degree nodes, per- 
forming random accesses during HeavyPath can result in 
extra costscompared with Rank Join. After £ = 4 on Cora 
and last.fm, and ^ = 11 on Bay, Rank Join runs out of the al- 
located memory and quits. In contrast, HeavyPath is able 
to compute the heaviest paths of length 8 on Cora, 7 on 
last.fm, and 36 on Bay before running out of the allocated 
memory. All algorithms are faster and can compute longer 
paths on Bay as compared with Cora and last.fm. Both 
the structure of the graph (especially the average degree), 
and the distribution of the edge weights play a role in the 
running time and memory usage. The Bay dataset is the 
largest of the three datasets, but has a small (i.e., 1.2) aver- 
age node degree, which makes it easier to traverse/construct 
paths. The Cora dataset has only 70 nodes, but is a dense 
graph that includes nodes which are connected to all other 
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Figure 6: Edge weight distributions for the Cora, last.fm and Bay datasets 
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(a) Vary £,k = 1, Cora 

Running time: Cora Dataset for f =4 
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Running time: last.fm Dataset for^=4 
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Figure 7: Running time comparisons for exact algorithms with different parameter settings 



nodes. In comparison, the last.fm graph has 40K nodes and 
average node degree 4.5. 

On analyzing these results, we make the observation that 
average node degree and edge weight distribution are the 
main parameters that define the hardness of an instance. 
In fact, our smallest dataset in terms of number of nodes 
and edges, Cora, is the most challenging of all. Polyzotis et 
al. [17], make similar observations in their experiments on 
the Rank Join problem. 



Running time for varying number of top-fc paths. 



Figures 7(d) 7(e) 7(f) show the running times for finding 



top-fc paths of a fixed length, while varying k from 1 to 100. 
We chose £ = 4 for Cora and last.fm and ^ = 10 for Bay as 
those were the longest path lengths for which all algorithms 
produced an output within the allocated memory. Since Dy- 
namic Programing is clearly very slow even for fc = 1, we 
focus on comparing the other algorithms in this and subse- 



quent experiments. Rank Join shows interesting behavior 
with increasing k. For smaller values of fc, e.g., fc < 50 



for last.fm (see Figure 7(e) I, the running time continues to 



increase as fc increases. When k is increased further, the run- 
ning time does not change significantly. When constructing 
paths, if the next heaviest path has already been constructed 
by Rank Join, it can output it immediately, and that seems 
to be the case for large values of fc. Recall that Rank Join 
may build a subpath many times while constructing paths 
that subsume it. The increase in running time for Heavy- 
Path as fc increases is insignificant. Since the algorithm 
builds on shorter heavy paths, several paths of length less 
than £ may already be in the buffers and can be extended 
to compute the next heaviest path. 



Impact of Random Access Strategy. Figure[8]compares 
HeavyPath without and with the random access strategy 
adopted and shows the additional benefit of employing the 
random access strategy. Consistent across all datasets, we 
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observe that using random access strategy speeds up the al- 
gorithm execution. The speedup is greater for longer paths. 

Interpreting heavy paths. We present a few representa- 
tive example to show the application of HPP to generating 
playlists and finding flow of ideas, as discussed in Section[l] 

The heaviest path returned by HeavyPath on last.fm with 
£ = 5 has the following songs: "Little Lion Man, Sigh No 
More, Timshel, I Gave You All, Winter Winds". We noticed 
that all these songs are from the "folk rock" genre and by the 
band "Mumford & Sons". Interestingly, the heaviest path of 
£ = 7 was comprised of a completely different set of songs: 
"Put It On The Air, Beach House, Substitute For Murder, 
Souvenir, Ease Your Mind, Novgorod, Sun God, Nemo" from 
the "art rock" genre, by the band "Gazpacho". On investigat- 
ing further, we noticed a trend in how users of last.fm create 
playlists, which is to add a collection of top/latest songs from 



a certain artist/band. Since the last.fm graph used in our 
experiments was abstracted from such playlists, we see paths 
corresponding to songs by the same artist. It would be inter- 
esting to see the paths that emerge from graphs abstracted 
from actual user listening history, which may potentially be 
more diverse. We leave such exploration for future work. 

We analyzed the "topic paths" that represent flow of ideas in 
the Cora graph of topics extracted from research paper ci- 
tations. The top path of = 3 had the topics: "Cooperative 
Human Computer Interaction (HCI), Distributed Operat- 
ing Systems (OS), Input, Output and Storage for Hardware 
and Architecture". It turns out that in this dataset, there 
are several papers about application like group chat and 
distributed white boards that involve collaborative comput- 
ing. Such applications are concerned about both reliabil- 
ity of distributed servers and real time end user experience. 
Hence, the topic path obtained from the graph highlighted 



this less obvious connection. Longer heavy paths also ex- 
hibit interesting patterns, for instance, this heaviest topic 
path of length 7: "Object Oriented Programming, Compiler 
Design, Memory Management in OS, Distributed OS, Co- 
operative HCI, Multimedia and HCI, Networking Protocols, 
Routing in Networks" corresponds to a surprising, and non- 
trivial flow of ideas across a chain of topics. 

Quality of paths using the heuristic approach. Fig- 
ure [9] shows the path weight obtained using HeavyPath- 
Heuristic as i increases, for k = \. The exact result as ob- 
tained by HeavyPath is plotted for comparison, along with 
an "upper bound" that represents the best possible score of a 
path for a given I as detailed in Section [531 As a baseline, we 
also plot the path resulting from a purely greedy approach 
that builds the heaviest path of length I by starting from 
the heaviest edge and repeatedly following heaviest edges 



out of the end nodes. For the Cora dataset in Figure 9(a 



the memory-bounded heuristic is no worse than 50% of the 
theoretically best possible path weight, and it finds heavier 
paths of a given length compared with the greedy approach. 
Recall, the theoretically best possible path weight is esti- 
mated on the upper bound U^es explained in Section [5.31 
The approach can be used to find a path connecting all nodes 
in the graph {(. = 69). The results for the last.fm dataset 
in Figure |9(b)| show that the heuristic approach is able to 
compute paths of long lengths, with the accuracy decreas- 
ing with path length. The greedy approach stalled after 
I = 40. The last.fm dataset has 6534 components, while 
other datasets have a single connected component. When 
the greedy starts with the heaviest edge, and the compo- 
nent to which it belongs has fewer than 40 nodes, the greedy 
algorithm restarts the process with the next heaviest edge. 
This results in a lot of exploration before the algorithm can 
result in a heavy path of ^ > 40. The heuristic manages to 
find paths of length up to 100. Even at that length, weight 
of the path it finds is about 37.5% of the theoretically best 
possible. Interestingly, the heuristic approach finds paths 
whose weight is very close to that of the best possible, for 



the Bay dataset as seen in Figure 9(c 



Figure 9(d) shows the empirical approximation ratio p as the 



path length increases, keeping the allocated memory fixed 
(as with Figure 9(a) I at 12GB. As the path length increases. 



p decreases. This is expected, given the same bootstrap from 
the exact algorithm, the estimate of the heuristic worsens 
with respect to the best possible weight. It should be noted 
that p is not a ratio w.r.t. the optimal solution, instead a 
conservative estimate of the worst case ratio of the result of 
our algorithm on a given graph instance. 

Memory for the heuristic approach. When the path 
length is fixed, and the memory allocated to the heuristic 
approach is increased, p increases. Figure |9(e)| illustrates 
this for the Cora dataset, with £ = 25,k = 1 and memory 
allocation varied to store 10 to 1 million paths. It is worth 
noting that even with as little memory as needed to hold 
5, 000 paths, p is already at 0.6 and with 250, 000 paths, it 
reaches 0.7. Further improvement is limited. 



Path and Greedy on the last.fm dataset. HeavyPath- 
Heuristic takes negligible amount of time for post-processing, 
for paths of long length (shown up to ^ = 100). Greedy does 
not scale beyond £ = 40. The other datasets show similar 
patterns for HeavyPath and HeavyPathHeuristic, while 
Greedy takes only tens of milliseconds on other data sets. 

6.3 Discussion 



We observed in Figures 7(a) that for one parameter setting of 



Running time for the heuristic approach. Figure 9(f) 



^ = 3 on the Cora dataset. Rank Join outperformed Heavy- 
Path in terms of running time for finding the top-1 path. 
We found that the random accesses performed for extend- 
ing paths ending in high degree nodes resulted in a slower 
termination. For instance, consider the sub-graph of the 
graph in Figure [3] induced by the nodes a' ,b' ,c' ,d'i, . . . ,d'„, 
but with the following edge weights: w^a^t') = l,™(b'.c') = 
1, = 1, and all other edge weights to nodes d'2 . ■ - d'^ 
the same at 0.01. Rank Join would scan the 3 top edges 
and terminate (since the top path weight equals the thresh- 
old), while HeavyPath will scan n — 1 additional edges by 
performing random accesses. A potential enhancement to 
HeavyPath is to perform the random accesses in a "lazy" 
fashion. In particular, perform random access on demand, 
and in a sorted order of non-increasing edge weights. Such 
an approach would avoid wasteful random accesses to edges 
that have very low weight. 

While our exact algorithms already scale much better than 
the classical Rank Join approach, our heuristics take a only 
tens of milliseconds to compute results that are no worse 
than 50% of the optimal for path lengths up to 50 on all 
datasets we tested. It is worth noting that heuristics for the 
problem of TSP have been extensively studiecFl for decades. 
Although these are not directly applicable to HPP, it would 
be interesting to explore the ideas employed in those heuris- 
tics to possibly get higher accuracy. 

We conducted various additional experiments focusing on 
metrics such as number of edge reads performed, number of 
paths constructed, and the rates at which the termination 
thresholds used by the algorithms decay. The details can be 
found in Appendix iBl 

7. RELATED WORK 

The problem of finding heavy paths is well suited for en- 
abling applications as described in Section[T] Recently, there 
has been interest in the problem of itinerary planning [3] 
118) . Choudhury et al. model it as a variation of the 
orienteering problem, and in their setting the end points of 
the tour are given, making the problem considerably simpler 
than ours. Xie et al. [TS] study it in the context of generat- 
ing packages (sets of items) under user-defined constraints, 
e.g., an itinerary of points of interest with 2 museums and 
2 parks in London. However, the recommendations that are 
returned to the user are sets of items which do not capture 
any order. In both these papers, the total cost of the POI 
tour is subject to a constraint and the objective is to maxi- 
mize the "value" of the tour, where the value is determined 
by user ratings. In contrast, by modeling itinerary finding 
as a HPP problem, we aim to minimize the cost of the tour 
(through high value items), which is technically a different 



shows the running time of the heuristic as well as of Heavy- ^' ^http: //www2 .research, att . com/~dsj/chtsp| 



problem. Another related work is [4] on generating a ranked 
list of papers, given query keywords. 

In [8], Hansen and Golbeck make a case for recommending 
playlists and other collections of items. The AutoDJ system 
of Piatt et al. [15] is one of the early works on playlist gen- 
eration that uses acoustic features of songs. There is a large 
number of similar services - see for comparison of some 
of these services. To our knowledge, playlist generation as 
an optimization problem has not been studied before. 

The HPP was studied recently by [2] for the specific applica- 
tion of finding "persistent chatter" in the blogosphere. Un- 
like our setting, the graph associated with their application 
is £-partite and acyclic. As mentioned in the introduction, 
adaptations of their algorithms to general graphs lead to 
rather expensive solutions. 

As mentioned earlier, the HPP problem can be mapped to 
a length restricted version of TSP known as ^-TSP, where t 
is the length of the tour. ^-TSP is NP-hard and inapprox- 
imable when triangular inequality doesn't hold pp. For the 
special Euclidean case, there is a 2-approximation algorithm 
due to Garg et al. [B]. We propose a practical solution based 
on the well established Rank Join framework for efficiently 
finding the exact answer to this problem for reasonable path 
lengths. To the best our knowledge, this solution is novel. 

Rank Join was first proposed by Ilyas et al. [101 1111 [T5] to 
produce top-fc join tuples in a relational database. They 
proposed the logical rank join operator that enforces no re- 
striction on the number of input relations. In addition, they 
describe HRJN, a binary operator for two input relations, 
and a pipelining solution for more than two relations since a 
multi-way join is not supported well by traditional database 
engines. It is well known that the multi-way Rank Join oper- 
ator is instance optimal, however similar optimality results 
are not known for iterated applications of the binary opera- 
tor [T71 [T^ . Therefore our comparison of HeavyPath with 
the multi-way Rank Join approach, to establish both our 
theoretical and empirical claims, is well justified. 

The seminal work on Rank Join [10] already mentions the 
potential usefulness of random accesses, however it does not 
evaluate it theoretically or empirically. A recent study [14] 
proposes a cost-based approach as a guideline for determin- 
ing when random access should be performed in the case of 
binary Rank Join. As such, their results are not directly ap- 
plicable for HPP. In contrast to all the prior work on Rank 
Join, we address the specific question of finding the top- 
k heaviest paths in a weighted graph and adapt the Rank 
Join framework to develop more efficient algorithms. Our 
techniques leverage the fact that the join is a self-join, store 
intermediate results and use random accesses to aggressively 
lower the thresholds and facilitate early termination. Be- 
sides, we develop techniques for carefully managing random 
accesses to minimize duplicate derivations of paths. Fur- 
thermore, we establish theoretical results that are in favor 
of using random access for repeated binary rank self-joins. 

8. CONCLUSIONS AND FUTURE WORK 

Finding the top-fc heaviest paths in a graph is an important 
problem with many practical applications. This problem is 



NP-hard. In this paper we focus on developing practical ex- 
act and heuristic algorithms for this problem. We identify 
its connection with the well-known Rank Join paradigm and 
provide insights on how to improve Rank Join for this spe- 
cial setting. To the best of our knowledge, we are the first 
to identify this connection. We present the HeavyPath al- 
gorithm that significantly improves a straightforward adap- 
tation of Rank Join by employing and controlling random 
accesses and via more aggressive threshold updating. We 
propose a practical heuristic algorithm which is able to pro- 
vide an empirical approximation ratio and scales well both 
w.r.t. path length and number of paths. Our experimental 
results suggest that our algorithms are both scalable and 
reliable. Our future work includes improving memory us- 
age and scalability of our exact algorithms, exploration and 
adaptation of ideas in [14] for improving the performance 
further, and application of our algorithms in rccommender 
systems and social networks. It is also interesting to investi- 
gate top-fc algorithms with probabilistic guarantees for the 
heavy path problem. 
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APPENDIX 

A. DYNAMIC PROGRAMMING DETAILS 

Essentially, dynamic programming constructs all simple paths 
of length I in order to find the heaviest among them. But 
unlike DPS, it aggregates path segments early, thus achiev- 
ing some pruning. More precisely, here are the equations of 
the dynamic program. Letters i,j,y denote variables which 
will be instantiated to actual graph nodes when the program 
is run. S denotes the avoidance set. The variable / will be 
instantiated in the range \2,l]. 

y^S,j&S} 

1 /cREATE(P,e(i,i)) ii[i,j)£E 

j S — S 

I NULL otherwise 

In the above equations, we can think of Pij^s ^ "path 
object" with properties path and weight. Here, Pi j g. path 
denotes the heaviest S-avoiding path from i to j of length 
I, and Pi j s.weight denotes its weight. The operator MAX 
takes a collection of path objects and finds the object with 
maximum weight among them. The "o" operator takes a 
P object and an edge e{u,v), concatenates the edge with 
the P.path and updates P.weight by adding the weight of 
the edge. Finally, create{P, e{i, j)) creates a path object P 
whose path property is initialized to and whose weight 
property is initialized to the edge weight of 

We invoke the dynamic program above using p£ j |j j every 
node j £V , where we have left the start node as a variable 
$i. The heaviest path of length £ in the graph is the heaviest 
among the paths found above for various j. 

For finding top-fc heaviest paths of a given length for fc > 1, 
all we need to do is define the MAX operator so that it 
works in an iterative mode and finds the next heaviest path 
segment (ending at a certain node, of a given length, and 
avoiding certain set of nodes). We can apply this idea re- 
cursively and easily extend the dynamic program for finding 
top-fc heaviest paths of a given length. 

B. ADDITIONAL EXPERIMENTS 

Our main result is that HeavyPath is algorithmically su- 
perior to all compared algorithms including Rank Join, in 
terms of the amount of work done. We establish this claim in 
multiple levels, by performing additional experiments that 
include comparing the number of edges read, the paths cre- 
ated, and the rate at which the thresholds decay. In all these 
experiments, we set fc = 1, i.e., we focused on the top path. 



Edge Reads. We measured the total number of edge reads 
performed by HeavyPath and by NextHeavyPath, where 
every time an edge is read, under sorted or random access 
or while performing joins to construct paths, is counted as 
one. We show the results in Figure 1101 The # edge reads 
for NextHeavyPath exceeds 10^° for path lengths beyond 
4 on Cora and last.fm and beyond 11 on Bay, which then 
runs out of memory. On the other hand, HeavyPath is able 
to go up to lengths 8, 7, and 36 on the three respective data 
sets, incurring far fewer edge reads. The reason for running 
out of memory has to do with the number of paths computed 
and examined, investigated next. 

Paths Constructed. We counted the number of paths. In 
case of Rank Join, we only counted paths of length £ where 
i is the required path length. In case of HeavyPath, we 
counted paths of all lengths < £ that are constructed and 
stored by it in its buffers. The results are shown in Fig- 
ure [TS] The relative performance of HeavyPath and Rank 
Join w.r.t. the number of paths computed by them is con- 
sistent with the trend observed in the previous experiment 
on # edge reads, showing HeavyPath ends up doing signif- 
icantly less work than Rank Join and hence is able to scale 
to longer lengths. 

Threshold Decay. Both algorithms rely on their stopping 
threshold, 9 for Rank Join and 9£ for HeavyPath for early 
termination. We measured the rate at which the thresholds 
decay since that gives an indication how quickly the algo- 
rithm will terminate. Figure [11] shows the results. For each 
algorithm, the threshold value is shown until the algorithm 
terminates, successfully finding the heaviest path. It is obvi- 
ous that in all cases, the threshold 9fi of HeavyPath decays 
extremely fast, whereas in comparison, the threshold 9 used 
by Rank Join decays much more slowly. Since HeavyPath 
employs random accesses whereas Rank Join doesn't, it is 
clear that random access is responsible for the rapid decay of 
the threshold, resulting in significant gain in performance. 
Please notice, the three experiments above are absolutely 
unaffected by systems issues, including garbage collection. 

Running time. Finally, we separated the total runni ng 
time into time spent on garbage collection and the restrn 
Let's call the rest "compute time" for convenience. We com- 
pare the total time for HeavyPath with the compute time 
for the other algorithms. It is worth noting here that for 
the majority of the points reported in our results (which of 
course correspond to those cases where the appropriate algo- 
rithm terminated) successfully, garbage collection time was 
either or very small. The result is shown in Figure [121 For 
simplicity, we "time out", i.e., stop the experiment, when- 
ever the time taken exceeds 10^ seconds. It can be observed 
that despite taking out the (small to negligible amount of) 
garbage collection from the total times of other algorithms, 
the total time of HeavyPath is still significantly less than 
the compute times of other algorithms. 

In each of these experiments, we can observe that Heavy- 
Path significantly outperforms other algorithms. These ex- 

^^We did this using the J stats tool and would be happy to 
provide additional details on this if required. 
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Figure 12: Comparing total time of HeavyPath to compute time of other algorithms 



periments clearly establish that HeavyPath's superiority 
over the other algorithms in terms of the amount of work 
done, i.e., in purely algorithmic terms, regardless of systems 
issues, and that our findings are not in any way compromised 
by garbage collection. 
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Figure 13: Comparing the number of paths created by HeavyPath and Rank Join 



